Migrating Language Resources from SGML to XML: The Text Encoding Initiative Recommendations

نویسندگان

  • Syd Bauman
  • Alejandro Bia
  • Lou Burnard
  • Tomaz Erjavec
  • Christine Ruotolo
  • Susan Schreibman
چکیده

The Text Encoding Initiative (TEI), established in 1987, has been the largest effort in the area of standardisation of computer encoding of language resources. TEI chose SGML (Standard Generalized Markup Language) as its underlying standard, and in the years before the inception of XML, a number of projects encoded their data according to some SGML DTD, TEI compliant, or otherwise. These projects could now benefit from migrating their data to XML. Apart from validation, the most compelling reason for migration is the scarcity of SGML-aware software and the abundance of XML-based tools and related recommendations. However, despite the fact that XML is a subset of SGML, migration is not a trivial process, especially in the case of large holdings of legacy language resources. This is why in 2002 the TEI Consortium established a Task Force on SGML to XML migration. The TF has now produced a number of reports that simplify and make explicit the conversion of SGML TEI (version P3) to XML TEI (version P4) documents. The reports are also relevant for a general audience of SGML users that are considering migrating their language resources to XML. This paper presents the recommendations made by the TF, concentrating on strategic considerations, the practical guide, and one case study, the conversion of the British National Corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Migrating Language Resources from SGML to XML:

The largest effort in the area of standardisation of computer encoding of language resources has been the Text Encoding Initiative (TEI), established in 1987. TEI chose as its underlying standard SGML (Standard Generalized Markup Language), and in the years before the inception of XML, a number of projects encoded their data according to some SGML DTD, TEI compliant, or otherwise. These project...

متن کامل

Text Encoding Initiative Consortium A Gentle Introduction to XML

As originally published in previous editions of the Guidelines, this chapter provided a gentle introduction to ‘just enough’ SGML for anyone to understand how the TEI used that standard. Since then, the Gentle Guide seems to have taken on a life of its own independent of the Guidelines, having been widely distributed (and flatteringly imitated) on the web. In revising it for the present draft, ...

متن کامل

Unification of XML Documents with Concurrent Markup

Annotating multiple hierarchies with SGML-based markup systems is still one of the fundamental problems of text-technological research. Up to now, several solutions have been discussed (e.g. chapter 31 of the TEI-Guidelines (Sperberg-McQueen and Burnard 1994) and Barnard et al. (1995)). Furthermore, some non-SGML based approaches have been proposed. (cf. Huitfeldt and SperbergMcQueen (2001) ; T...

متن کامل

Complementary Approaches to Representing Differences Between Structured Documents

Structured documents Documents can be represented as structures with a hierarchical arrangement of text and non-text nodes, where nodes are labelled by category names such as “paragraph” and “section”. Representing documents this way is a natural consequence of using the Standard Generalized Markup Language (SGML) to encode the content and form of documents [10, 11, 7]. SGML is widely used. HTM...

متن کامل

Lessons learned from using SGML in the Text Encoding Initiative

In April of 1994 the ACH-ALLC-ACL Text Encoding Initiative published Guidelines for Electronic Text Encoding and Interchange (Document TEI P3). SGML was used as the basis for the encoding scheme that was developed. Several innovative approaches to the use of SGML were devised during the course of the project. Three aspects of this innovation are documented in the paper. First, all of the tags a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004